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Abstract 


In this paper, we propose a novel word- 
alignment-based method to solve the FAQ- 
based question answering task. First, we 
employ a neural network model to calcu¬ 
late question similarity, where the word 
alignment between two questions is used 
for extracting features. Second, we de¬ 
sign a bootstrap-based feature extraction 
method to extract a small set of effec¬ 
tive lexical features. Third, we pro¬ 
pose a learning-to-rank algorithm to train 
parameters more suitable for the rank¬ 
ing tasks. Experimental results, con¬ 
ducted on three languages (English, Span¬ 
ish and Japanese), demonstrate that the 
question similarity model is more effec¬ 
tive than baseline systems, the sparse fea¬ 
tures bring 5% improvements on top-1 ac¬ 
curacy, and the learning-to-rank algorithm 
works significantly better than the tradi¬ 
tional method. We further evaluate our 
method on the answer sentence selection 
task. Our method outperforms all the pre¬ 
vious systems on the standard TREC data 
set. 


1 Introduction 


Question Answering (QA) aims to automatically 
understand natural language questions and to re¬ 
spond with actual answers. The state-of-the-art 
QA systems usually work relatively well for fac¬ 
toid, list and definition questions, but they might 
not necessarily work well for real world questions, 
where more comprehensive answers are required. 
Frequently Asked Questions (FAQ) based QA is 
an economical and practical solution for general 
QA ( [Burke et al., 1997| ). Instead of answering 
questions from scratch, FAQ-based QA tries to 
search the FAQ archives and check if a similar 


question was previously asked. If a similar ques¬ 
tion is found, the corresponding answer is returned 
to the user. The FAQ archives are usually created 
by experts, so the returned answers are usually of 
higher-quality. 


The core of FAQ-based QA is to calculate se¬ 
mantic similarities between questions. This is 
a very challenging task, because two questions, 
which share the same meaning, may be quite dif¬ 
ferent at the word or syntactic level. For exam¬ 
ple, “How do I add a vehicle to this policy?” and 
“What should I do to extend this policy for my 
new car?” have few words in common, but they 
share the same answer. In the past two decades, 
many efforts have been made to tackle this lex¬ 
ical gap problem. One type of methods tried to 
bridge the lexical gap by utilizing semantic lex¬ 
icons, like WordNet ( [Burke et al., 1997} [Wu et 
al., 2005| [Yang et al., 2007 ). Another method 
treated this task as a statistical machine transla¬ 
tion problem, and employed a parallel question set 
to learn word-to-word or phrase-to-phrase transla¬ 
tion probabilities ( [Berger and Lafferty, 1999[|Jeon 
et al., 2005} [Lee et al., 2008t [Xue et al., 2008t 


Bernhard and Gurevych, 200^[Zhou et al., 201 1[ ). 

Both of these methods have drawbacks. The first 
method is hard to adapt to many other languages, 
because the semantic lexicon is unavailable. For 
the second method, a large parallel question set 
is required to learn the translation probabilities, 
which is usually hard or expensive to acquire. To 
overcome these drawbacks, we utilize distributed 
word representations to calculate the similarity be¬ 
tween words, which can be easily trained by only 
using amount of monolingual data. 


In this paper, we propose a novel word- 
alignment-based method to solve the FAQ-based 
QA tasks. The characteristics of our method in¬ 
clude: (1) A neural network model for calculat¬ 
ing question similarity with word alignment fea¬ 
tures. For an input question and a candidate ques- 



























Figure 1: Word alignment example. 


tion, the similarities of each word pairs (between 
the two questions) are calculated first, and then the 
best word alignment for the two questions is com¬ 
puted. We extract a vector of dense features from 
the word alignment, then import the feature vec¬ 
tor into a neural network and calculate the ques¬ 
tion similarity in the network’s output layer. (2) 
A bootstrap-based feature extraction method. The 
FAQ archives usually contain less than a few hun¬ 
dred questions, and in order to avoid overfitting, 
we are unable to use too many sparse features. 
Therefore, we come up with this method to extract 
a small set of effective sparse features according 
to our system’s ranking results. (3) A learning-to- 
rank algorithm for training. The FAQ-based QA 
task is essentially a ranking task, our model not 
only needs to calculate a proper similarity for each 
question pair, but also needs to rank the most rel¬ 
evant one on top of the other candidates. So we 
propose a learning-to-rank method to train param¬ 
eters more suitable for ranking. Experimental re¬ 
sults, conducted on FAQ archives from three lan¬ 
guages, demonstrate that our method is very effec¬ 
tive. We also evaluate our method on the answer 
sentence selection task. Experimental results on 
the standard TREC data set show that our method 
outperforms all previous state-of-the-art systems. 

2 Method 

The task of FAQ-based QA is that given a query 
question, rank all the candidate questions accord¬ 
ing to the similarities between two questions. We 
define the similarity between a query Q and a can¬ 
didate C as sim{Q^ C) = f{X), where X is a fea¬ 
ture vector extracted from (Q, C) pair, and /(*) is 
a linear or non-linear function. In this work, we 
represent /(*) as a neural network model, where 
the input is the feature vector X, and the output 
layer contains only one neuron which predicts the 
similarity of the two questions. We choose the sig¬ 
moid activation function for the output layer, so 
that the similarity is constrained in the range [0, 


1]. In order to execute this model, we still have 
two remaining questions: (1) How to represent the 
feature vector X ? (2) How to train the model for 
ranking? 

2.1 Feature Definition 

For the first question, we propose to extract fea¬ 
tures from the word alignment between two ques¬ 
tions. Let’s denote the query question as Q = 
go, ^ 1 , •••, and the candidate question as 

C = Co, Cl,..., Cj,..., c^, where qi and Cj repre¬ 
sent words in questions. First, we calculate the 
word similarity between qi and Cj according to the 
cosine distance of their distributed representations 


sim{qi^Cj) = max{0, cosine{vqi, Vcj)) 

Then, we obtain the similarity matrix for the ques¬ 
tion pair by calculating similarities of all word 
pairs, e.g. Figure [^a). Finally, we compute the 
best word alignment for the question pair based on 
the similarity matrix, e.g. Figure [^b) shows the 
word alignment computed based on Figurea). 

We define some dense features based on the 
word alignment. Let’s denote the alignment po¬ 
sition for each query word qi as aligrii, and the 
corresponding alignment score as sirrii. For ex¬ 
ample, for word qs in Figure aligns = 4 and 
sims = 0.6. We denote the unaligned word as 
unaligui. We also take into account the impor¬ 
tance of each word by employing its inverse docu¬ 
ment frequency (IDF) score, and denote it as idfi. 
We define the following dense features: 

• similarity: /o = Y.i sirrii * idfi/ J2i idfi- 
This feature measures the question similarity 
based on the aligned words. 

• dispersion: fi = — aligui-i — 

11 )^. This feature prefers the candidate where 
contiguous query words are aligned to con¬ 
tiguous words in the candidate. 

• penalty: /2 = Eunaiigm ^^fi/ E* idfi- This 
feature penalizes candidates based on the un¬ 
aligned query words. 

• 5 important words: 

This feature type contains 5 features. Each 
feature shows the alignment score of the i-th 
important words, where we evaluate the im¬ 
portance of a word by its IDF score. 






















• reverse: Extract above features by swapping 
roles of query and candidate questions. 

We also define some spare lexical features. 
Considering the fact that our FAQ archives contain 
only less than a few hundred questions, we can¬ 
not extract too manny sparse features, otherwise 
our model will overfit the training set. We de¬ 
sign a bootstrap-based feature extraction method 
to extract a small set of effective lexical features 
according to the model’s ranking results. The in¬ 
put to our method contains a seed model, a FAQ 
archive and a set of sparse feature templates. For 
each query question, the workflow includes: 

• Step 1: Rank all the candidates with the seed 
model. If the rank-1 candidate is relevant, re¬ 
turn without doing anything. 

• Step 2: Find the first relevant candidate 
from the ranking list, and collect all the irrel¬ 
evant candidates {C~} above . 

• Step 3: Collect sparse features F+ from (7+, 
and sparse features F~ from {C~}. Then, 
only keep the sparse features occurred in F+ 
but not occurred in 


We use this method to extract a group of sparse 
features, then add these features to the model 
and retrain the model. This procedure can it¬ 
erate many times until getting a stable perfor¬ 
mance. Our feature templates contain: aligned 
query words, aligned candidate words and aligned 
query-candidate word pairs. In our experiments, 
the performance can converge within 10 iterations, 
and the final model usually contains less than 
1,500 sparse features. 


2.2 Learning to Rank Algorithm 

The traditional method for training the similarity 
model is to cast the task as a binary classification 
problem ( [Heilman and Smith, 2010[ jSeveryn and 


Moschitti, 2013[ Yao et al., 2013| ). All the possi¬ 


ble question pairs are collected from the training 
set, and if the question pair is relevant, assign a 
label “+1” , otherwise assign a label “-1”. Then 
the model is trained to optimize the classification 
accuracy. However, the FAQ-based QA task is es¬ 
sentially a ranking problem. The similarity model 
not only needs to calculate a proper similarity for 
each question pair, but also needs to rank the most 
relevant candidate on top of the others. Therefore, 


we propose a novel learning-to-rank algorithm to 
explicitly optimize the ranking (top-1) accuracy. 
We define the loss function for each query Qi and 
all its irrelevant candidates {Cj} as: 

li — max(0, e -h sim{Qi^ Cj) — sim{Qi^ C'j*)) 

"-V-" 

termi 

- sim{Qi,Cj*) 

' -V-' 

term2 


where e is a margin, and (7j* is the first relevant 
candidate in the ranking list, termi aims to de¬ 
crease the similarities for the irrelevant candidates 
ranked above Cj* or below Cj* but with a margin 
less than e, and term 2 aims to improve the simi¬ 
larity of Cj*. We utilize the back propagation al¬ 
gorithm ( jRumelhart et al., 1988 ) to minimize the 
loss function over the training set, and employ the 
AdaGrad strategy ( jPuchi et al., 201 1| ). In our ex¬ 
periments, we set e as 0.03 and the learning rate as 
0 . 1 . 


3 Experiment 

3.1 Experimental Setting 

We conducted experiments on FAQ archives from 
three languages (English, Spanish and Japanese). 
These FAQ archives are collected from customer 
service or online Q&A webpage of three compa¬ 
nies. The number of question-answer pairs for 
each archive is 987 (English), 687 (Spanish) and 
2384 (Japanese). Each question may have more 
than one relevant questions within the archives. 
We split each archive into train, dev and test sets. 
The number of questions for each set is 790/99/98, 
549/69/69 and 1684/350/350. To train distributed 
word representations and calculate IDF scores, we 
employed the English Gigaword (LDC2011T07), 
the Spanish Gigaword (LDC2011T12) and an in- 
house Japanese corpus (about 2 billion tokens). 
These corpus were preprocessed by our in-house 
tokenizer, and the word2vec[^ toolkit was used for 
training the distributed word representations. 

3.2 Characteristics of Our Model 

First, we conducted a group of experiments by 
incrementally adding features. We used the 
learning-to-rank algorithm to train models, but 
didn’t use hidden layer for these models. Figure 
l^a) shows the top-1 accuracies on the dev sets. 

^ http s: //code. google. com/p/word2vec/ 
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Figure 2: Characteristics of our model. 


We found that the “dispersion” and “penalty” fea¬ 
tures are effective for English. The “five important 
words” features are helpful for both Spanish and 
Japanese. The “reverse” feature is very useful for 
English and Japanese. The “sparse” features bring 
around 5% improvements for all languages. 

Second, we verified the effectiveness of our 
learning-to-rank algorithm. We built two systems: 
the first one takes only the dense features, and the 
second one takes both dense and sparse features. 
The two systems were trained with both the tra¬ 
ditional classification method and our learning-to- 
rank method. Experimental results on dev sets are 
given in Figure |^b). We see that the learning- 
to-rank method works consistently better than the 
classification method. 

Third, we illustrated the influence of the neu¬ 
ral network structure by changing the number of 
hidden neurons. We found the model acquired the 
best performance when using 300 hidden neurons. 
Figure l^c) shows the performance on the dev sets 
of two systems. The first system has no hidden 
layer, and the second one employs 300 hidden neu¬ 
rons. We can find that using the hidden layer is 
really helpful for the final performance. 

3.3 Evaluation on the test set 

In this section, we evaluated our systems on the 
test sets. We tested three systems: (1) “Dense” 
takes the dense features, (2) “Sparse” takes both 
dense and sparse features, and (3) “SparseHidden” 
adds 300 hidden neurons to the second system. We 
also designed three baseline systems: (1) “BagOf- 
Word” calculates question similarity by counting 
how many query words also occur in the can¬ 
didate; (2) “IDF-VSM” represents each question 
with vector space model (VSM) (each dimension 
is the IDF score of the corresponding word), and 


calculates the cosine distance as the question sim¬ 
ilarity; (3) “Similarity” only uses our “similar¬ 
ity” feature. Table gives the top-1 accuracies. 
The “BagOfWord” and ’TDF-VSM” systems only 
counted the exactly matched words, so they didn’t 
work well. The “Similarity” system got some im¬ 
provements by matching words with distributed 
word representations. The performance of the 
baseline systems also showed the difficulty of this 
task. By adding dense and sparse features ex¬ 
tracted from word alignment, our systems signifi¬ 
cantly outperformed the baseline systems. 


3.4 Evaluation on Answer Sentence Selection 


To compare with other state-of-the-art systems, we 
further evaluated our system on the answer sen¬ 
tence selection task with the standard TREC data 


set (Voorhees and Tice, 1999). The task is to 


rank candidate answers for each question, which 
is very similar to our FAQ-based QA task. We 
used the same experimental setup as |Wang et aL 
|(2007| ), and evaluated the result with Mean Aver¬ 
age Precision (MAP) and Mean Reciprocal Rank 
(MRR). Table shows the performance from our 
system and the state-of-the-art systems. We ob¬ 
serve our system get a significant improvement 
than the other systems. Therefore, our method is 
quite effective for this kind of ranking tasks. 


4 Conclusion 

In this paper, we propose a question similarity 
model to extract features from word alignment be¬ 
tween two questions. We also come up with a 
bootstrap-based feature extraction method to ex¬ 
tract a small set of effective lexical features. By 
training the model with our learning-to-rank al¬ 
gorithm, the model works very well for both the 
FAQ-based QA task and the answer sentence se- 

























English 

Spanish 

Japanese 

BagOfWord 

31.63 

36.23 

55.71 

IDF-VSM 

37.76 

37.68 

58.29 

Similarity 

41.84 

40.58 

67.43 

Dense 

45.92 

44.90 

67.14 

Sparse 

51.02 

50.72 

69.43 

SparseHidden 

52.04 

59.42 

70.29 


Table 1: Evaluation on the test set 





MAP 

MRR 

Wang et al. (2007 

) 

0.603 

0.685 

Heilman and Smith (2010 

) 0.609 

0.692 

Yao et al. (2013 

0 


0.631 

0.748 

Severyn and Moschitti (2013 

) 0.678 

0.736 

Yih et al. (2013 

) 


0.709 

0.770 

Yu et al. (2014) 



0.711 

0.785 

Our Method 



0.746 

0.820 


Table 2: Evaluation on Answer Sentence Selection 


lection task. 
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